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INTERNATIONAL PATENT CLASS: G06F-017/30 

...SPECIFICATION proportional to the product of the length of the two 
sequences to compare. 

The Wilbur-Lipman algorithm compares contiguous tuples of small 
length in the original and reference strings. Tuples are matched for both 
sequences using a look-up table that is created from the reference 
string. The score for each candidate match is computed and the best 
score is selected. A new look-up is therefore created each time a new 
reference sequence must be compared against the database . Since the 
entire set of original string must be checked against the look-up table 
the amount of computation required to match against a database 
containing a total of 2N nucleotides or amino acids will be double that 
required for a database with only N nucleotides or amino acids. In 
other words the number of comparisons against the look... votes in the EIT 
that different original strings receive when compared to a given 
reference string, a degree of similarity between each original string 
and the reference string can be established. The original strings 
receiving a higher number of votes in the EIT (after all reference 
tuples are compared ) are more similar to the reference string than 
original strings receiving a lower number of votes. 

In summary, After the exactly and similarly matching original strings 
have been determined, they are located in the .database . Refer to boxes 
60 and 65 in Figure 1. To do this, the cells in the EIT... 

...indexes are not generated from the tuples. Information records are 
placed in cells in the look-up table . For simplicity, the information 
record only includes a pointer to the starting location of the original 



string, HOTEL, in the database . 

Comparing each reference tuple (index) to the original tuples 
results in 9 matches. These results are placed in the EIT. In this 
example, 9 matches shows a high degree of correlation . A reference 
string like "SOLID" would have no matches. 
This gives a total of 9 matches on. . . 

. SPECIFICATION proportional to the product of the length of the two 
sequences to compare. 

The Wilbur-Lipman algorithm compares contiguous tuples of small 
length in the original and reference strings. Tuples are matched for both 
sequences using a look-up table that is created from the reference 
string. The score for each candidate match is computed and the best 
score is selected. A new look-up is therefore created each time a new 
reference sequence must be compared against the database . Since the 
entire set of original string must be checked against the look-up table 
the amount of computation required to match against a database 
containing a total of 2N nucleotides or amino acids will be double that 
required for a database with only N nucleotides or amino acids. In 
other words the number of comparisons against the look... votes in the EIT 
that different original strings receive when compared to a given 
reference string, a degree of similarity between each original string 
and the reference string can be established. The original strings 
receiving a higher number of votes in the EIT (after all reference 
tuples . are compared ) are more similar to the reference string than 
original strings receiving a lower number, of votes. 

In summary, After the exactly and similarly matching original strings 
have been determined, they are located in the database . Refer to boxes 
60 and 65 in Figure 1. To do this, the cells in the EIT... 

.indexes are not generated from the tuples. Information records are 
placed in cells in the look-up table . For simplicity, the information 
record only includes a pointer to the starting location of the original 
string, HOTEL, in the database . 

Comparing each reference tuple (index) to the original tuples 
results in 9 matches. These results are placed in the EIT. In this 
example, 9 matches shows a high degree of correlation . A reference 
string like "SOLID" would have no matches. 
This gives a total of 9 matches on. . . 
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TECHNIQUES FOR PERFORMING A DATA QUERY IN A COMPUTER SYSTEM 

TECHNIQUES D'EXECUTION D'UNE DEMANDE DE DONNEES DANS UN SYSTEME 
INFORMAT IQUE 
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Publication Language: English 
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Fulltext Word Count: 49717 

Patent and Priority Information (Country, Number, Date) : 

Patent: ... 20001005 

Main International Patent Class: G06F-017/10 
International Patent Class: G06F-005/14 ... 
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Fulltext Availability: 
Derailed Description 
Publication Year: 2000 

Detailed Description 

. . . current update entry. At step 1066, a score is computed for each name 
comparison of the existing database entry with a record of the updated 
version of the database . The score is computed as one point per 
matching component. At step 1068, 
In 

control returns to... 

...of Figure 4 9 attempt to formulate a numeric quantity or metric for 
determining whether two name entries match . This weighted value or 
concatenation is used in further comparison in combination with other 
field , such as the zip code, and arriving at a final quantity in 
determining whether or not name fields of an existing ' database entry 
and an update record match. 

Referring now to Figure 50, shown as a flow chart of... 

. . .normalized metric or score based on the name field and the zip code. At 
step 1080, the score previously derived from name match for each 
entry is updated by one if the zip codes of an existing database entry 
match an updated entry. At step 1082 this score is normalized by taking 
the score computed. . . 
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EXTRACTION OF VENDOR INFORMATION FROM WEB SITES 

EXTRACTION D 1 INFORMATIONS DE SITES WEB CONCERNANT DES VENDEURS 
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(Residence), US (Nationality) 
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Publication Language : English 
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Fulltext Word Count: 15500 

Patent and Priority Information (Country, Number, Date) : 
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Main International Patent Class: G06F-017/30 
Fulltext Availability: 

Detailed Description 
Publication Year: 2000 

Detailed Description 

... an empirical rating of the quality of the match represented by the 
entry. The three entries in Table I I correspond to matches of the 
tokens "downtown" and "ants" with the last two rows of . . . 

...three matches relate to the vendor identified by Companyld "222." 
However, in a live, functioning IMMM vendor database , hundreds of 
matches related to many different companies may be detected and entered 
into the relational table "tempdb . dbo . TokenMatch . " . The merging tool 
next computes a cumulative total score for each Companyld having entried 
in the relational table "tempdb . dbo . TokenMatch" and enters that total 
score , along with the corresponding Companyld, into the relational 
table "ternpdb. dbo . TokenSumrnary, " shown below in Table 12. 

Table 12 

tempdb . dbo . TokenSumrnary 
Companyld TotalScore 
222 70 

Again, in general, relational table "ternpdb . dbo . TokenSumrnary " may 
contain tens or hundreds of different entries following the analysis of a 
single record from the relational table "InterDbl" by the merging tool. 
The merging tool selects the Companyld or the inf ori-nation in the 
selected row of the relational table "InterDB I " with the row in the 
relational table "Company" identified, via token matching, as 
describing the vendor represented by the selected row of the relational 
table "InterDB L" This field -bv- field comparison is facilitated by 
the entry in the relational table "CompanyMerge" having the same value 
for the field. . . 
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SYSTEM AND METHOD FOR INDEXING INFORMATION ABOUT ENTITIES FROM DIFFERENT 

INFORMATION SOURCES 
SYSTEME ET PROCEDE PERMETTANT L ' INDEXAGE D 1 INFORMATIONS RELATIVES A DES 

ENTITES PROVENANT DE SOURCES D 1 INFORMATION DIFFERENTES 

Patent Applicant /Assignee : 
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Publication Language: English 

Fulltext Word Count: 11692 
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Main International Patent Class: G06F-017/30 
International Patent Class: G06F-15:40 ... 

. . . G06F-17: 60 

Fulltext Availability: 

Claims 
Publication Year: 1998 

Claim 

1 A system for associating a data record from an information source into 
a database , the database containing a plurality of data records, the 
system comprising: means for receiving a data record from an. . . 

...data record having a predeten-nined number of fields containing 
inforination about a 
particular entity; 

means for comparing selected fields within the received data record 
with 

corresponding fields within the data records already in the database ; 
means, responsive to comparison, for identifying data records already in 
the database having data within some of the selected fields that match 
to the data in the 

fields of the received data record as possible matching candidates; and 
means for scoring the identified matching candidates using a 
predetermined scoring criteria which measures a likelihood of a 
match between the received data record and the data records in the 
database to determine if the received data record and a data record in 
the database contains information about the same entity thereby 
associating data records about the same entity despite errors contained 
...for updating the rules database with additional rules. 

15 The system of Claim 14, wherein said rule database updating means 
comprises means for comparing a new rule with a rule already in the rules 

database and means for synthesizing the data records associated with 
the new rule with the data records associated with a previous rule in the 
rules database . 

16 A method for associating a data record from an information source into 
a database , the database containing a plurality of data records, the 
method comprising: receiving a data record from an information source, 
the received data record having a predetermined number of fields 
containing information about a particular 

entity; 

comparing selected fields within the received data record with 
corresponding 

fields within the data records already in the database ; 
identifying data records already in the database , based on the 
, comparison, having data within some of the selected fields that match to 
the data in the fields of the 

received data record as possible matching candidates; and 

scoring the identified matching candidates using a predetermined 
scoring 

criteria which measures a likelihood of a match between the received 
data record and the data records in the database to deten-nine if the 
received data record and a data record in the database contains 
information about the same entity thereby associating data records about 
the same entity despite errors contained. . . 
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> ' ANALYSE COMPARATIVE DE PRODUITS DE TRANSCRIPTION GENIQUES 
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Detailed Description 
Publication Year: 1995 

Detailed Description 
. . . AND DRAWINGS 
4*le TABLES 

Table 1 presents a detailed explanation of the letter 
codes utilized in Tables 2 

Table 2 lists the one hundred most common gene 
transcripts, It is a partial list of isolates from the 
HUVEC cDNA library prepared and sequenced as described 
below, The left-hand column refers to the sequence's order 
of abundance in this table . The next column labeled 
"number" is the clone number of the first HUVEC sequence 
identification reference matching the sequence in the 
"entry" column number. Isolates that have not been 
sequenced are not present in Table 2. The next column, 
labeled f IN", indicates the total number of cDNAs which have 
the same degree of match with the sequence of the reference 
transcript in the "entry" column. 

The column labeled "entry" gives the NIH GENBANK locus 
name, which corresponds to the library sequence numbers. 

The 'Is" column indicates in a few cases the species of the 
reference sequence. The code for column 'Is" is given in 

Table 1, The column labeled "descriptor" provides a plain 
English explanation of the identity of the sequence 
corresponding to the NIH GENBANK locus 'name in the "entry" 

column . 

Table 3 is a comparison of the top fifteen most 
abundant gene transcripts in normal monocytes and activated 
macrophage cells, 

Table 4 is a detailed summary of library subtraction 
analysis summary comparing the THP-1 and human macrophage 
cDNA sequences. In Table 4, the same... 
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Data entry comparison method for database system, involves comparing 

data field definitions to preset patterns to determine whether data 

entries are matching or unmatching 
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Abstract (Basic): US 20040267743 Al 

NOVELTY - The matching percentage scores corresponding to 
data- field of each data entry, are combined to produce a composite 
score . The fields are defined as matching , unmatching or 
unclassified using the composite score. The field definitions are 
compared to preset patterns to determine whether entries are matching 
or unmatching. The final comparison result is determined based on 
composite score and determination result. 

DETAILED DESCRIPTION - An INDEPENDENT CLAIM is also included for 
method of identifying whether data entry input is duplicative of 
existing data entries. 

USE - For comparing data entries related to merchant information in 
database system using computer system connected to internet, intranet 
and data network and used in transaction card company. 

ADVANTAGE - Since the data entries are compared easily, the 
duplicate entries of the merchant information is recognized, thus 
multiple registrations of the merchant is prevented efficiently. 

DESCRIPTION OF DRAWING (S) - The figure shows a flow diagram of the 
process performed by the system for identifying duplicate entries in 
the database . 

duplicate entry identifying system (10) 

pp; 13 DwgNo 1/6 

Title Terms: DATA; ENTER; COMPARE; METHOD; DATABASE ; SYSTEM; COMPARE; 

DATA; FIELD; DEFINE; PRESET; PATTERN; DETERMINE; DATA; ENTER; MATCH 
Derwent Class: T01 
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Sales promotion information extraction program involves calculating 
similarity between preference information of customer with field 
information of each item for displaying goods with high degree of 
similarity 
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Abstract (Basic) : JP 2004295624 A 

NOVELTY - A preference information corresponding to the input 
customer ID acquired from the customer database (16) is compared 
with the field information of each item contained in the goods 
database (15). The similarity between both the information is 
calculated for displaying goods with high degree of similarity . 

DETAILED DESCRIPTION - An INDEPENDENT CLAIM is also included for 
computer readable recording medium storing sales promotion information 
extraction program. 

USE - For extraction of sales promotion information. 

ADVANTAGE - Performs efficient search with respect to targeted 
goods. 

DESCRIPTION OF DRAWING (S) - The figure shows the block diagram of 
the sales promotion information extraction system. (Drawing includes 
non-English language text). 

sales promotion information extraction system (10) 

goods database (15) 

customer database (16) 

goods characteristic file (23) 

customer characteristic file (24) 

goods characteristic configuration file (26) 
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Abstract (Basic): FR 2841672 Al 

NOVELTY - Objects contained in data bases (4,5), are entered 
into a destination data base (6) and in a first stage classified 
according to categories, use and characteristics (12,13,14,). In a 
second stage all the objects in the first and second catalogues having 
the same use are placed in a reconciliation table (16). Finally 
technical characteristics (15) are compared field by field , 
analysed (17) and an equivalence level produced for the 
reconciliation table . 

USE - To establish levels of equivalence between similar or 
identical objects in two or more data bases . Particular application 
to equipment catalogues produced by different manufacturers or 
suppliers 

ADVANTAGE - The method enables the cost of large scale purchases of 
many items to be minimised by identifying equivalent uses and technical 
characteristics between products offered by various manufacturers or 
suppliers 

DESCRIPTION OF DRAWING (S) - The drawing shows the structure of the 
destination data base . (The drawing includes non-English language 
text) 

Object data bases (4,5) 
Destination data base (6) 

Categories, sub-categories and use (12,13,14,) 
Technical characteristics (15) 
Reconciliation table (16) 
Technical comparison analysis (17) 
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The method of determining contents involves attaching a database 
(106) to be mapped. A field identification module (102) is activated. 
The module scans the contents of the data field so as to locate a data 
field identifier. The identifiers are each compared with a known 
list. A comparison score is assessed. The data field is sampled 
according to a pre-selected list of requirements. 

The data field is compared with the requirements and a sampling 
score is assessed. A test case where the data field is used in an 
application programme (100) is constructed. A score is assessed for 
the accuracy of a third comparison between actual and expected 
results of the test case. A field type is chosen based upon a 
cumulative result of the three scores. 

USE / ADVANTAGE - For use with address and barcode printing system. 
Is accurate due to consistent application of decision making model. 
Allows for effective use of time due to determining relevance of 
contents of data field. 
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Abstract (Basic) : EP 433964 A 

The name resolution process begins by determining an exact match 
score and initialising count fields. A set is then constructed of 
all index entries having first (surname) fields matching the first 
(surname) field of the purported word. An initial 'guess' at the 
surname component is selected and all index entries are identified 
which have either (a) surnames equal to the last word of the purported 
name 'surname 1 field or (b) multiple word surnames whose last word 
equals the last word of the purported name 'surname' field. For each of 
the names in the set, the first fields of the entry name and the 
purported name are compared. If they do not match another entry name is 
selected from the set . 

The process then selects a second field from the purported name 
and compares the second fields of the purported and entry names. A 
third (personal name) field is then selected for processing. Finally a 
comparison score is determined for the degree of similarity between 
the purported and entry names so that the best match may be selected. 
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ABSTRACT: Address matching is the process of adding locational information 
to a database containing business, survey, or administrative records. It 
is a very powerful GIS technology, because it can transform any existing 
database with street addresses into a GIS database that can be mapped or 
used as input into more sophisticated spatial analyses. Address matching 
is inexpensive, as well, since it can be performed on microcomputers with 
low-cost GIS software and the US Census Bureau's TIGER/Line files. 
Although address matching has a number of important limitations, it is 
nevertheless one of the most cost-effective means for applying GIS mapping 
and spatial analysis tools to the country's pressing urban problems. 

TEXT: Over the last 20 years, geographic information systems (GIS) have 
transformed the way in which urban planners conduct environmental analyses, 
register parcel boundaries, map infrastructure location, and model 
transportation systems (Duecker and DeLacey 1990, Huxhold 1991, Harris and 
Batty 1993) . Surprisingly, though, GIS technology has had much less effect 
on the analysis of the human activity patterns that underlie the countries 
most pressing urban problems, including poverty, crime, education, teenage 
pregnancy, public health, and unemployment .( 1 ) 

One of the major reasons for this relative neglect is the difficulty of 
generating accurate, timely, and inexpensive locational information for 
human activities. Recently, however, the use of low-cost GIS software for 
generating such data has become a realistic option. Address matching {also 
known as geocoding(2) is a very powerful GIS technology: it can convert any 
administrative, survey, or business database with street addresses into a 
GIS database containing locational information. The resulting database can 
then be either displayed as a pin map, aggregated into regions and 
displayed as a thematic map, combined with U.S. census information and 
other available GIS data, or used as input into the full range of advanced 
GIS procedures for spatial analysis. 

Table 1 lists, in the first column, typical public sector administrative 
databases that are available for most urban areas. In column 2 the table 
shows the range of potential GIS applications that can be performed once 
the original database has been address matched. (3) {Table 1 omitted) 

Address matching technology has been available in some form for more than 
thirty years, first as stand-alone software, and now embedded within GIS 
packages. It was initially developed by the U.S. Bureau of the Census for a 
specific purpose: to allocate population accurately within blocks, census 
tracts, and other geographic areas, without the expense of sending 
enumerators to every dwelling unit in the country (U.S. Bureau of the 
Census 1970; Marx 1990). In 1970 the Census Bureau used computerized files 
of street names and address ranges in order to assign geographic codes 
within 145 urban areas. For the 1980 census, the Bureau added spatial 
information (longitude and latitude) and a number of nonstreet features, to 
create the GBF/DIME files. For 1990 the Bureau expanded the spatial 
database to cover the entire country, greatly increased the number of 



address ranges, and developed a more sophisticated data structure, 
producing the TIGER system of databases. 

Databases and software for address matching have been available from the 
Bureau of the Census for 25 years (U.S. Bureau of the Census 1970), but 
low-cost, microcomputer-based address matching has become practical only 
recently. This achievement is due to three factors: the continuously 
increasing power and the decreasing cost of microcomputer hardware, the 
growing sophistication of' desktop GIS software, and the recent availability 
of the Census Bureau's TIGER/Line files on CD-ROM. The current version of 
the TIGER/Line files includes address ranges for areas encompassing over 80 
percent of the United States population (U.S. Bureau of the Census 1993b). 
Any agency, business, or organization within these areas can now begin 
address matching for an initial investment of less than $5,000, including 
microcomputer hardware, GIS software, and the appropriate TIGER/Line 
databases . 

This article provides an overview of address matching technology so that 
urban planners can have a realistic understanding of the capabilities, 
limitations, and costs of this newly available microcomputer-based 
technology. The next five sections of the article will answer these basic 
questions: (1) what is address matching? (2) what are the data and 
technology requirements for address matching? (3) what are the sources and 
consequences of address match errors? (4) what software tools are available 
to improve address match results? and (5) what role can urban planners play 
in the larger community of address match practitioners? 

Basics of Address Matching 

Address matching can be defined, most generally, as the process of linking 
records in two databases, based upon street address (modified from U.S. 
Bureau of Census 1970, 18). The first of these databases, the reference 
database, contains both address information and locational information. 
Usually the reference database contains one record for every street 
segment, that is the portion of a street between one intersecting street 
and the next intersecting street. (4) This record includes fields for 
directional prefix, street name, street type, and directional suffix. Four 
additional fields store the low and high addresses on the segment's left 
side, and the low and high addresses on its right side. Information on the 
latitude and longitude of the segment endpoints is also stored, as is 
optional information on the location of intermediate points (or shape 
points) between the two endpoints. Other optional fields may store the zip 
code, city name, or census tract number of each side of the street segment. 
Currently, the most common source of reference databases is the U.S. Census 
Bureau's TIGER/ Line files (U.S. Bureau of the Census 1993b), which will be 
discussed in more detail below. 

The second required database is the target database. It contains address 
information plus additional database fields characterizing the event that 
took place at the address or was related to the address. (The first column 
of Table 1 lists many of the target databases most frequently used in urban 
areas.) The address matching software attempts to identify a street segment 
record in the reference database that has the same street name, street 
type, and other identifiers as the record in the target database. After the 
two records have been matched, interpolation is used to assign geographic 
coordinates to the target database record, and other locational information 
(such as census tract number) can be copied from the reference database 
record to the target database record. This matching process is a 
specialized example of the way in which a relational database management 
system can join two databases that share a common field or set of fields. 
Figure 1 shows example records from both reference and target databases, 
and illustrates the typical outcomes produced by the matching 
process . (Figure 1 omitted) 

The first and second records of the target database each generate a perfect 
match with a single record in the reference database, since the street 
name, street type, and other address fields of the two records are 
identical, and the target address number is contained within the reference 



record's address range. The third record achieves a partial march, since 
most of the fields are identical but there is a difference in street type. 
A partial match can result when no perfect match is available but there are 
substantial similarities between a target record's address and the address 
information in one or more reference database records. Partial matches will 
be discussed in considerable detail below. The fourth target record 
generates partial matches with two different reference records because of 
the omission of the directional prefix ("N" or "S" for north or south) in 
the target record. The fifth and sixth target records cannot currently be 
matched. There appear to be no possible matches for the fifth record, but 
in the sixth target record perhaps "Maine St." is a misspelling of "Main 
St." The sixth record is an example of a potential match, an unmatched 
record that could be matched if the analyst were willing to adopt less 
stringent match criteria. 

Once the target address has been matched to a single street segment in the 
reference database, the address matching software checks the parity 
(even/odd designation) of the target address and determines whether it 
falls on the left or right side of the street. The software then 
interpolates to determine the address's approximate distance along the 
street segment by using the low and high addresses of the block face, as 
shown in Figure 2. (Figure 2 omitted) 

The example in Figure 2 shows that street number 306 lies about two-thirds 
of the way between the low address on the right (302) and the high address 
on the right (308). Therefore, by interpolation, the address is assumed to 
lie two-thirds of the way ([306-302] / [308-302] = .6666) between the two 
end points of the street segment. A user-specified offset (e.g., 25 feet) 
may then be applied to move the location a set distance away from the 
street centerline. To provide additional control on the correctness of the 
match, the GIS may compare the city name or zip code in the target record 
with similar fields in the reference record. This use of check-fields 
allows the process to distinguish between multiple instances of "Main 
Street" in the same county or region. Finally, the x-y location of the 
address is preserved by writing it into the target database, or 
user-selected geographic codes (such as census tract numbers) are copied 
from the reference database into the target database. It is important to 
emphasize that the interpolation process, by nature, can produce only an 
approximate location for each target record address. In Figure 2, for 
example, the interpolated position for 304 N. Leisure Lane would be 
one-third of the way along the street. However, its actual position is 
located in the middle of the block, since the building at number 302 is on 
a large lot. 

Requirements for Address Matching 

The successful linkage of target records and reference records is the heart 
of address matching. However, this procedure is only one component of the 
larger process, illustrated in Figure 3 as a series of nine steps. (Figure 3 
omitted) 

Steps one, two, and three (on the left side of Figure 3) generate the 
reference database, usually a GIS database of street-segment centerlines 
with address range information for each side of the street segments. The 
most common source for building a reference database is the TIGER/Line 
files developed by the U.S. Bureau of the Census (Klosterman and Lew 1992, 
U.S. Bureau of the Census 1993b). TIGER/Line files for most states are 
available on a single CD-ROM for $250. (See the Appendix for a listing of 
sources for address matching datasets and software.) Although the street 
centerline network covers the entire United States, the files do not (and 
cannot) contain address ranges for rural areas where city-style address 
ranges do not exist, or for areas that have been recently developed. The 
documentation manual for the TIGER/Line files (U.S. Bureau of the Census 
1993b) includes a county-by-county listing of the estimated proportion of 
street segments with address ranges. 

There are three other potential sources for reference databases. First, the 
best available reference databases are those that are continuously updated 
by local, regional, or state government agencies, but they are available 
only in limited areas. Second, several commercial firms (such as Geographic 



Data Technology and Etak) have added expanded address ranges to the 
TIGER/Line files. In some areas the commercial databases represent a 
significant improvement over the TIGER/Line files; in other areas the 
differences are minimal. Before purchasing one of these products, the user 
should compare the percentage 'of street segments having TIGER/Line address 
ranges in the area of interest to the percentage in the commercial product, 
for a rough measure of the commercial database's added value. (5) 

The third source for reference databases is now the least important, but 
could eventually become the most, important . The U.S. Postal Service sells a 
set of national street-segment databases that allow organizations with 
large mailing lists to generate the appropriate nine-digit zip codes from 
databases containing only street addresses and five-digit zip codes (U.S. 
Postal Service 1993) . In urban areas, the nine-digit zip codes generally 
equate to city blocks or even smaller areas. (6) These databases contain 
many of the same fields as do the TIGER/Line files (street name, street 
type, low address left, etc.), and also include the full nine-digit zip 
code for every street segment. Unfortunately, they do not contain any 
geographic or locational information. Ho.wever, a second Postal Service 
database called ZIP/TIGER (U.S. Postal Service 1991) relates nine-digit zip 
codes to the TIGER Census geography, and includes latitudes and longitudes 
of varying accuracy. Because the Postal Service continuously updates these 
databases, over time they will improve in coverage and accuracy. By 
contrast, the TIGER/Line files may not be substantially updated until the 
next census, and will become less and less acceptable as an address match 
reference file. The "Geocoder" package from Strategic Mapping uses the 
Postal Service databases, and in the future many GIS packages may be able 
to incorporate them directly into the address matching process. 

Once the TIGER/Line files or an appropriate substitute have been obtained, 
the database must be converted from the TIGER/Line format to the specific 
internal format required by user's GIS or stand-alone address match 
software. There are multiple releases of the TIGER/Line files, the most 
recent being the 1992 version, and the format of TIGER/Line data has 
changed slightly with each one. The GIS system's import facility must be 
capable of handling the 1992 version, which has significantly expanded 
address ranges. Once the TIGER/Line files have been imported, it is 
beneficial for the GIS to conduct automated consistency checks to identify 
parity errors (even and odd addresses on the same side of the street), 
overlapping address ranges, inconsistent street names, and other 
potentially correctable errors. These automated consistency checks are 
found in UNIX workstation GIS systems such as ARC/INFO, but are not yet 
widely available on microcomputer GIS packages. 

Steps four, five, and six (on the right side of Figure 3) must be 
accomplished for each target database that will be address matched. The 
target records may be stored in any of a huge variety of internal database 
formats, but in virtually every case it will be possible to convert the 
records into an ASCII file with fixed field widths or comma-delimited 
fields, then import the ASCII file into the GIS database format. Most 
microcomputer GIS packages store much of their own data as dBase DBF files, 
making it particularly easy to address match databases already in that 
format . 

The target databases, unfortunately, generally store address information in 
different styles, and even within the same database the address data will 
be inconsistently formatted. Apartment numbers, building numbers, street 
abbreviations, and street spellings are the most common sources of 
inconsistency. The GIS or matching software should standardize the address 
elements, drop information that is irrelevant for address matching (such as 
apartment, floor, and building numbers) and convert addresses to a uniform 
format. This will allow the most records to be matched accurately without 
user intervention. 

Once both the reference and target databases have been prepared, the actual 
matching process can begin. Stand-alone address match packages and 
GIS-based address match tools usually allow for both batch and interactive 
matching. During batch matching the software matches as many records as 



possible without user intervention (step 7) . If only a small number of 
unmatched records remain, the residual records can be interactively matched 
(step 8). Interactive matching displays each unmatched record for the user 
to edit, in order to correct misspellings, expand abbreviations, or delete 
extraneous information that has prevented a correct match. The software may 
also present the street-segment records that are very similar to the" 
unmatched target record, or allow the user to generate x and y coordinates 
for the record by pointing to a location on the screen. 

The final step in the process is the application of the full range of GIS 
tools for the display and analysis of the geocoded database. For example, 
either the entire database or any user-specified subset can be displayed as 
a pin map by placing a symbol at the location of each event, or different 
symbols can be used for different types of events. An area thematic map can 
aggregate the number of events within a set of regions (such as police 
precincts or school districts) and shade each region according to the 
average or total number of events in the region. Other types of GIS data, 
such as detailed population information from the U.S. census, can be 
integrated to produce maps showing, for example, teenage fertility rates, 
dropout rates, or crime rates. Address matched data can also be used as an 
input to more sophisticated GIS applications such as optimized facility 
location, school bus routing to minimize miles driven, or the delineation 
of public facility service areas and estimated facility serviceloads . 

Errors in Address Matching 

The key to successful address matching is how well the software and analyst 
jointly respond to unmatched addresses. With a fully accurate street 
network and an error-free target database, address matching is a trivial 
task. But in the real world, errors in both the street network and target 
addresses mean that a 100 percent match rate is possible for only very 
small databases. Typical errors in the street network include missing 
street segments, missing address ranges, and incorrectly ranged streets. In 
the address field of the target database, characteristic problems include 
abbreviated street names ("MLK Dr" for "Martin Luther King Jr. Drive"), 
misspelled street names ("Main Street" for "Maine Street"), complex 
directional combinations ("100 E N St NE"), and confusing or ambiguous 
information ("Apt 3 N Decatur Road") . 

After any address matching operation, the target records can theoretically 
be subdivided into four result classes. (7) The first class comprises 
correct matches, those records that now contain the proper (approximate) x 
and y coordinates or geographic codes for the target address. The second 
class contains unmatched records that could not be matched under any 
circumstances. That could be due to an address that is missing, mangled, or 
nothing more than a post office box. It could also result from the location 
of the address in a very recent subdivision not included in the reference 
database. The third and fourth result classes describe two fundamental 
types of matching errors. Incorrect matches, or false positives, include 
records that are matched, for some reason, to the incorrect street 
segments. Nonmatches that could have been matched, or false negatives, can 
be due to a small mistake such as the omission of a directional suffix. 

In most cases the analyst will find perfect matches for 25 to 75 percent of 
the target database records. But how should the unmatched records be 
handled? There are three basic options. Strategy A ignores all records that 
are not perfect matches and uses only the perfectly matched records for 
mapping or further analysis. Strategy B enlarges the set of matched records 
by including both perfect matches and the most reliable types of partial 
matches. Strategy C expands the matched record set still further by 
including perfect matches and every possible partial match, of whatever 
reliability. Figure 4 shows the results of applying the three strategies to 
a typical target database. (Figure 4 omitted) 

The lowest section of each stacked bar represents those records that cannot 
achieve any kind of match, either perfect or partial, no matter how relaxed 
the match criteria. The next, white, area of each bar depicts the portion 
of target records that are correctly matched. The third set of bars (with 
cross hatching) shows potential matches, that is, records that are 



currently treated as unmatched, but could be matched if additional types of 
partial matches were allowed. The top, dark, bars show the proportion of 
the target database that has been matched, but incorrectly. 

The number of total matches (correct and incorrect) divided by the total 
number of records is the match rate, while the number of incorrect matches 
divided by the total number of matches is the error rate. In moving from 
strategy A to strategy B to strategy C, the analyst relaxes the match 
requirements and transforms more and more potential matches into correct 
matches and mistaken matches. During this process the match rate increases 
(from 55%, to 80%, to 95%), but the error rate jumps as well (from 9%, to 
12%, to 21%) . 

Unfortunately, there can be no absolute rule to determine the proper 
tradeoff between the advantage of matching more records and the 
disadvantage of decreased match reliability. Instead, two general questions 
must guide the analyst. First, how do the benefits of a higher match rate 
compare to the costs of a higher error rate? In most cases, an approach 
like that of Strategy B would be best, since the 25% increase in match rate 
comes at the cost of only a 3% increase in the error rate. But when a GIS 
is used for instant dispatch of emergency services, the consequences of a 
high error rate (emergency vehicles sent to the wrong location) would be 
much more severe than those of a lower match rate, so Strategy A would be 
the most appropriate. At the other extreme, during a GIS application to 
route school buses, a high error rate would result in some empty seats on 
some buses, but a low match rate would require many children to stand on 
overcrowded buses. For this application, Strategy C would probably be 
superior. 

The second question the analyst must ask is whether there is reason to 
suspect a systematic bias in the unmatched records, and whether that bias 
could substantially distort the results of the analysis. Matching software 
that cannot recognize apartment numbers may fail to match any address that 
includes an apartment number. This means that the set of successful matches 
will exclude many apartment dwellers and is thereby likely to 
underrepresent lower-income residents. In another example, any matching* 
process that uses the unmodified TIGER files will fail to locate addresses 
in recently developed areas. This will introduce a bias against new homes, 
and a spatial bias against rapidly developing areas. In any single case, 
these sources of bias may or may not be important, depending on the type of 
target database being used and the fundamental purpose of the analysis 
(Drummond 1993) . 

A final caution concerns the spatial limitations of address matching. Any 
location generated through address matching will be approximate and 
necessarily less accurate than the original street-segment database. 
TIGER/Line files, for example, were digitized from the Census Bureau's 
original Address Coding Guides and U.S.G.S. 1:100,000 maps (Marx 1990). In 
general, well-defined points in the TIGER/Line files will be placed within 
167 feet of their true location on the ground. Considering the additional 
uncertainties introduced by location interpolation, of sets, and other 
factors, address matched locations should never be used when a high level 
of spatial accuracy (within 200 feet) is required. For example, address 
matched locations should not be overlaid with an urban parcel boundary 
database, since the matched addresses will often fall outside the correct 
parcel boundaries. 

Tools for Producing Partial Matches 

For better or for worse, then, what are the actual mechanisms for 
generating partial matches? They fall into two categories: transformation 
tools and procedural strategies. Transformation tools are applied to the 
target database and/or reference database to convert the address 
information to a form more likely to produce a match. At the lowest level, 
data in the target record's address may have to be reordered, shifted to 
consistent upper case, or purged of commas. Translation tables can be used 
to standardize directional information, street names, and street types. A 
translation table would change "Street," "St," "St.," and "Str" to "St" as 
the standard form. User-modifiable translation tables allow common local 



abbreviations ( "MLK" for "Martin Luther King J'-" ) to be inserted into the 
tables provided with the software. 

The most powerful (and dangerous) transformation tool is the soundex 
function (Knuth 1973) . This function creates a general phonetic equivalent 
(or soundex) for the written spelling of each street name, allowing the 
software to compare street names based upon their soundex values rather 
than their literal spellings. A simplified soundex function might retain 
the initial letter of the street name, drop all subsequent vowels and 
silent consonants, replace similar consonants with a single equivalent, and 
limit the soundex length to four letters. The street names "Main," "Maine," 
and "Mane" would all have the same soundex: "MN." The names "McDonald," 
"MacDonald, " "McDonnell," and "McDonough" would also have the same soundex: 
"MKDN." As these examples suggest, the use of a soundex function can 
dramatically increase the address match rate, especially for databases 
whose address source was verbal, or databases in which the address field is 
marred by careless errors *of data input and spelling. Use o a soundex 
function will also increase the error rate, perhaps very substantially. 
Whenever analysts apply an aggressive transformation tool like the soundex 
function, they should use additional check fields (such as zip code) and . 
carefully examine a sample of the resulting soundex-based matches to gauge 
their general reliability. 

Address matching software can also create partial matches through different 
procedural strategies. At present these fall into three major classes: 
criteria relaxation, scoring tables, and probability analysis. Criteria 
relaxation (used by Atlas GIS and Maplnfo) requires multiple passes through 
the target database with successive loosening of match requirements. The 
first pass might require an exact match of street number, directional 
prefix, street name, street type, directional suffix, and zip code. The 
second pass might ignore directional prefix and directional suffix. A final 
pass might substitute the soundex function for the street name. During each 
pass, the user specifies which criteria are applied, and all applied 
criteria determine whether the match is successful or unsuccessful. 
The scoring table approach (used by ARC/INFO release 6 and ARCVIEW 
release 1) first applies a soundex algorithm to the street name, then 
generates a list of candidate street segments. Each candidate begins with a 
score of 100, but the GIS deducts a number of user-specified points for 
each nonmatching element. A missing directional suffix or a different 
directional suffix could each have a one point penalty; an exact street 
name match would have no penalty, but a soundex match could have a three 
point penalty. The candidate segment with the highest score would then be 
assigned the match , as long as the score were above a user-specified 
minimum. 

The statistical probability approach (licensed for future release in 
several commercial GIS packages) creates matches based upon a sophisticated 
general theory of record matching (Jaro 1989, Jaro 1993) . (8) To simplify 
somewhat, the target database and reference database are both broken 
into subsets based upon values of a common field such as .zip code. Within 
each subset every target record is compared to every reference record. 
During this one-to-one comparison , address-related fields of the target 
record (street number, directional prefix, etc.) are assigned individual 
weights ( scores ) based upon their similarity to the reference- record. 
The weights are then summed for a' total weight (score) representing the 
likelihood that the given target record is located on the particular street 
segment. Record comparisons with total weights above a user-specified 
threshold value are designated as matches ; comparisons with weights 
under a second user-specified threshold are nonmatches; and comparisons 
with weights between the two thresholds must be reviewed interactively by 
the analyst. Remaining nonmatches are broken into subsets by a different 
.field (such as street name soundex), and the process is repeated. 

These three procedural strategies illustrate the classic computing tradeoff 
between ease of use and power. The criteria relaxation method is the 
easiest to understand and use. The statistical probability approach is the 
most powerful, flexible, and statistically valid method. It will usually 
achieve the highest match rate and lowest error rate, but it is also the 



most difficult to learn and use. The scoring table method combines good 
ease of use with moderate flexibility. At present, the criteria relaxation 
and scoring table methods are best suited for organizations without a 
history of address matching, while the statistical probability approach 
will appeal to experienced analysts who want to produce the best possible 
results. In the future, users can expect several GIS vendors to incorporate 
versions of the statistical probability method, while attempting to 
maintain a reasonable ease of use. 

Address Update Programs 

Over time, new subdivisions appear, street names change, address ranges 
expand or contract. An ongoing address matching program requires updated 
reference files, but this is only the first and least important of reasons 
for local governments to maintain updated, accurate address records. The 
second reason is preparation for the year 2000 census. Governments with 
locally updated TIGER/Line files can share them with the Census Bureau, 
making possible a more accurate (and in many cases, larger) enumeration. 
For cities in which the census undercount is a chronic problem, the 
potential gain from increased revenues and decreased legal costs could 
recoup the initial investment many times over. The final reason for 
updating addresses is the growing demand for accurate address information 
from other units of government. Emergency response (E-911) services are a 
prime example: for them, precise address information is critical to 
protecting lives and property. The U.S. Postal Service is also working to 
extend city-style address ranges into rural areas as part of its route 
restructuring program (Fusaro 1993, LaMacchia 1993) . 

It makes no sense for many organizations to update separate, inevitably 
conflicting address files for the same area. In most cities and counties 
the local planning department is the natural lead agency to undertake this 
task, since it already oversees the development process and may have 
responsibility for approval of new street names. The planning department 
also usually has good working relationships with other departments of local 
government, regional agencies, and public utilities, all of whom may be 
maintaining extensive internal address databases. 

Table 2 shows the major types of address update activities .( Figure 2 
omitted) The two rows of the table differentiate the types of addresses 
affected (new or existing)., and the columns separate activities into 
reactive, nonintrusive measures and more aggressive, proactive measures. 

The upper left-hand cell of the table lists nonintrusive activities for 
existing addresses. Any agency that begins address matching will quickly 
find numerous errors and possible errors in even the most recent TIGER/Line 
files, or in any other reference database. If these errors are logged, they 
can then be used to check, correct, and even expand the address ranges and 
other address characteristics. A separate database table of address 
exceptions can contain references to out-of-sequence numbers, parity 
errors, and other exceptional circumstances that do not fit the normal 
expectations for address location. Going still further, an extended alias 
database can be constructed so that the most common forms of misspecified 
addresses can be translated directly into their correct, standard 
equivalents (Hurst 1993) . 

These activities can all operate within the established record structure of 
the current TIGER/Line files; For new development, however, an agency must 
digitize additional street segments with new address ranges, as shown in 
the lower left hand cell of Table 2. In order to maintain compatibility 
with the existing TIGER files, any agency adding new records should closely 
follow the guidelines available from the Census Bureau (U.S. Bureau of the 
Census 1993a) . This will ensure that the updated local information is as 
useful as possible in preparing for the 2000 census. 

In many jurisdictions, new street names and addresses already must be 
approved by planning departments (Table 2, lower right-hand cell) . Because 
of the spread of address-dependent E-911 systems, it is becoming more and 
more important for each business or home to have a unique, unambiguous 
address. Since the planning department usually regulates the overall 
development process, it is not difficult to require the department's 



approval for new street names and addresses. 

The final class of address maintenance activities is the most aggressive 
and potentially the most controversial. In any city's current addressing 
scheme there can be several different names for the same physical street, 
multiple spellings of street names, misordered addresses, address parity 
errors, and different streets with similar or even identical names. The 
visitor to Atlanta, for example, will find more than three dozen streets 
with every possible variation of the name "Peachtree." Not only do these 
problems make it difficult for visitors and citizens to find locations, but 
property and even lives are threatened when police, fire, and ambulance 
services cannot respond quickly to emergency calls. If the number of 
nonstandard addresses is very large, a local government may wish to 
rationalize its addressing scheme by reassigning numerical addresses and 
modifying street names (Eichelberger 1993) . The benefits will be 
substantial, but so will be the opposition of citizens and businesses who 
understandably prefer their traditional, established addresses. A 
rationalization program can be undertaken only when the advantages 
(especially in terms of emergency service response) are so overwhelming 
that they clearly outweigh the imposed inconvenience of new addresses. For 
most governments, a combination of the other Table 2 activities will 
provide sufficient, inexpensive, and noncontroversial means of maintaining 
an accurate address database. Because of the recent release of the 1992 
TIGER/Line files (in early 1994), this is an excellent time to begin 
address update programs. The longer the delay in beginning a program, the 
more difficult and expensive it will be to implement. 

Microcomputer-based address matching now allows urban planners who deal 
wit;h demographic, social, and economic data to routinely use the same GIS 
technology that is revolutionizing environmental analysis, parcel mapping, 
transportation modeling, and infrastructure management. Of course, it would 
be far too much to expect that GIS (or any other computer technology) could 
provide new solutions for the country's urban problems. But perhaps 
planners armed with these newly available visualization and analysis tools 
can begin to develop a deeper understanding of the spatial activity 
patterns that produce poverty, crime, teenage pregnancy, and chronic 
unemployment. We could then target our limited dollars of social service 
spending to reduce needless duplication and ensure that help is nearest at 
hand for those who need it most. This in itself would be a considerable 
achievement . 

NOTES 

1. There are many examples of the innovative application of GIS technology 
to urban problems, such as Jimmy Carter's Atlanta Project (Sawicki 1993) 
and the Providence (Rhode Island) Plan (The Providence Plan 1994). Most of 
these applications have been developed by academicians and community 
organizations, not by urban planners within city government. For a brief 
but broad sample of these types of extragovernmental activity, see Urban 
and Regional Information Systems Association (1994), 16-19. 

2. The term "geocoding" can be used as an exact synonym for address 
matching (Cooke 1993), but it can also refer, more broadly, to the general 
process of adding location identifiers such as county names or census tract 
numbers to a database containing information on points, lines, or polygons 
(Huxhold 1991, 319). In this second, broader meaning, geocoding can be 
based upon street addresses, city names, five-digit zip codes, or even 
telephone exchanges . 

3. Address matching can also be an invaluable tool for certain types of 
environment, transportation, and infrastructure applications. These GIS 
application areas, however, have other, primary sources of detailed spatial 
data, including existing maps, aerial photography, and satellite imagery. 
Address matching is usually the main source of spatial data for small-area 
demographic, social, and economic analysis. Even U.S. census data or urban 
areas, as will be discussed below, is ultimately derived from address 
matching. 

4. Reference databases may also consist of detailed polygon data, such as 



parcel boundaries, or point data, such as nine-digit zip locations. In 
these cases there is no need for the type of spatial interpolation depicted 
in Figure 2, so the locational information in the reference record is 
simply copied to the target record. Otherwise, the match process is the 
same whether the reference database consists of points, lines, or polygons. 

5. Commercial street-segment databases provide significant advantages 
besides expanded address ranges. These can include more types of features, 
corrected street names and zip codes, additional shape points for better 
geographic representation, greater spatial accuracy, improved connectivity 
of segments within the street network, more consistent classification of 
road types, and direct availability in the internal data format of major 
GIS packages. 

6. If the final two digits of an address's street number are added to its 
nine-digit zip code, the result is an eleven-digit number that should be 
unique or every street address throughout the United States. In the future, 
these numbers could be used to develop a nationwide database containing the 
latitude and longitude of every address in the country. 

7. The classification is theoretical because an actual classification would 
require complete and perfect information as to which matches were correct 
and which were not. 

8. Matchware is not a GIS, but a generalized record matching program that 
can also be used, or example, to remove duplicate records from databases. 
APPENDIX 

Datasets and Software Services 

Data Services ; Division 

Customer Services 

Bureau of the ; Census 

U.S. Department of Commerce 

Washington, DC 20233 

1301-763-4100 

TIGER/Line files and Census data CD-ROM disks 

U.S. Postal Service 

National Customer Support Center 

6060 Primacy Parkway STE 201 Memphis, TN 38188-0001 
800-331-5746 

Zip code address match databases and ZIP/TIGER databases 

Environmental Systems Research Institute 

380 New York Street 

Redlands, CA 92373 

714-793-2853 

ARCVIEW and ARC/INFO GIS software and spatial datasets 
Strategic Mapping Inc. 3135 Kifer Road 
Santa Clara, CA 95051 



408-970-9600 



Atlas GIS software and spatial datasets 

Maplnf o 

200 Broadway 

Troy, NY 121800 

613-226-8673 

Maplnfo GIS software and spatial datasets 
Geographic Data Technology Inc. 

13 Dartmouth College Highway 
P.O. Box 377 

Lyme, NH 03768 
603-795-2183 

Dynamap 2000 spatial datasets 

Etak, Inc. 

1430 O'Brien Drive 

Menlo Park, CA 94025 

415-328-3825 

ETAK spatial datasets 

Matchware Technologies Inc. 

14 637 Locustwood Lane 
Silver Spring, MD 20905 
301-384-3997 

Autostan address standardization software and Automatch record matching 
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ways to automate the decision as to whether two records match. One, 
the deterministic, decision- table approach, performs a pattern- or 
rule-based lookup in a table. The other method, probabilistic. . . 

...compared is evaluated and given a score or letter grade that tells how 
well it matched . All the grades are lined up to form a pattern, 
maintaining visibility for each field. The pattern is... 

. . .matched to a static table that tells the system whether that particular 
configuration of field scores should or should not be matched . 

Probabilistic linking also evaluates each field, but the score 
numerically represents that field's information content or amount of 
information (its emphasis, significance, or usefulness in making a 
matching decision) . Then, the individual field scores are summed, to 
produce a final score precisely measuring the information content of the 
fields being compared for a match . That final score , or match 
weight , can be converted into an odds ratio for an accurate gauge of the 
probability of... 

...based on characteristics of the data. For example, the measurement 
process will give a higher weight to a match between a pair of Social 
Security numbers than it will give to a match between gender indicators 
'like "M" and "F." It will also give a higher weight to matching rare 
values, like the first name Horatio, than it will give to matching common 
values . . . 
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... phone-line. You can also page any customer, client, employee, or 

friend in your contact database with a single mouse click. You can even 
have GoldMine page you if you don. . . 

. . .Merge/Purge Wizard walks you through a potentially complex procedure 
that lets you determine the degree of similarity between records, 
triggering a purge. You can compare fields by sound, exact match, first 
word, or first x number of characters—and then set... 
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that is preferred by standard methods, and it brings home the true 
dimensionality of the database . In the case of the census database , 
there are over 50,000 different words, potentially corresponding to a 
50, 000-dimensional, binary... 

...performing nearest neighbor classification with free text is that the 
textual data contained in the fields is not easily comparable . There 
are so many different ways of expressing occupations and industries as 
phrases that exact . . . 

...with only a slight modification: "Serving food and drinks"). With 
numeric fields, a distance or degree of match can be computed even in 
the absence of exact match (i.e., we know 100 ... computer languages (*Lisp 
for PACE and Fortran for AIOCS) . Additionally, each system required the 
preclassif ied database of examples. 

The main effort in building an MBR system is deciding how to compare 
the various types of fields for similarity. For numerical and logical 
fields, one can use statistical methods to decide how. . . 

...see which gave the best classification performance on a test dataset. 
For text fields, a weighted match based on the methods used for 
text-text comparisons in the SEEKER system [26] could. . . 



6/3, K/4 (Item 4 from file: 275) 

DIALOG (R) File 275: Gale Group Computer DB(TM) 
(c) 2005 The Gale Group. All rts. reserv. 

01244206 SUPPLIER NUMBER: 06505431 (USE FORMAT 7 OR 9 FOR FULL TEXT) 

The right job. (Software Review) (career counseling software) (evaluation) 

Krengel, Larry 

Classroom Computer Learning, v8, n7, pl6(2) 
April, 1988 

DOCUMENT TYPE: evaluation ISSN: 0746-4223 LANGUAGE: ENGLISH 

RECORD TYPE: FULLTEXT; ABSTRACT 

WORD COUNT: 978 LINE COUNT: 00074 

... abilities. To help with this process, THe Right Job sets up a very 

simple two- column visual comparison , with the student's ideal 
conditions listed on the left and the job's actual... 

. . .Moving to another disk in the four-disk program, Mike had little 
difficulty using a database search routine to. explore jobs that had not 
appeared on his previous lists, and to search for jobs that matched 
certain criteria such as training level or minimum salary. 

Still another section of The Right Job sets up a simulation of . . . 
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... value =[ (increase in R-squared) * (degrees of freedom) /(l - 

R-squared) ] .sup. 1/2 

Now compare the t values in column N with those found in a 
standard statistical Student ! s t-distribut ion table . In the table 
you'll find that, with 13 degrees of freedom, a t value of 4.221 
corresponds to the highest level of significance. The t value in cell 
is 10,40633844, which greatly exceeds 4... 
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locating objects and capturing their attributes in the field in 
real-time with an astonishing degree of accuracy . 

"The available technology has allowed us to create an efficient and 
cost effective product for... 

...G&O Atlanta's Survey Department Head. 

County mapping is loaded into portable computers, enabling field 
crews to locate valves and compare GPS results to supposed locations. 
Once a valve is located, it is then "exercised" - turned. . . 

...to open and close the valve. 

These and other attributes are captured into a GIS database in th 
field and the valve is marked with paint to indicate what was done... 
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they are not at the zero level, would allow Latin America to 
compete on something closer to a level playing field , representing 
significant improvement compared to the past. 

While much has been made in the international forum of the search... 

...to define themselves in terms of global markets and not ideologies. 
Copyright 1994 Latin American Database /Latin American Institute 
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works both on the "front end" on the desktop, and the "back end" on 
a database . The index is interfaced with the hospital's registration 
system. While a patient is being registered, the index system checks the 
registration database . It scans existing registration files by 20 
different data fields and uses fuzzy logic--a. . . 

...of mathematical algorithms that identifies information not by 
"incorrect" and "correct" values, but by the degree of similarity . 
Provider organizations can create algorithms that, for example, recognize 
"Bill" and "William" as similar names... 

. . .dates could be transposed. 

Fuzzy logic works in tandem with other advanced search capabilities 
to compare information in multiple data fields in a master patient 
index's "master file" to fields in an information system's... 
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... is correspondence analysis (Ludovic, Morineau, and Warwick, 1984; 

Myers, 1996) . 

In correspondence analysis, a contingency table is first 
established, in this case, 10 automobile categories by 91 programs, with 
the program. . . 

...each cell. A chi-square statistic is then computed for each cell in the 
contingency table based on row (program) and column (product) totals to 

compare the expected VCI to the actual . These chi-square statistics are 
then the input for... 

...similar to a principal components (factor) analysis. The program 
produces n dimensions and creates the equivalent of a factor score on 
each dimension for each program and product jointly. This score is based on 
the. . . 
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episodic occurrence rather than a factor characteristic of a 
specific geographic area. The results from Table 1 show that the etch 
performance increases from top (acrylic melamine) to bottom (acrylic 
urethane . , . 

...the weight loss observed in laboratory tests, and the average etch 
rating from t he field tests are compared . It is evident that the 
relative laboratory performance and field etch ratings show very similar 
trends. A similar plot of the weight loss from laboratory testing 
versus the average field rating is shown in Figure 5. This plot displays 
the degree of correlation between the laboratory results and the field 
results for conventional clearcoat systems cured under nominal... 
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least 50 mils of wall loss was measured. These data are used to 
represent the field measurements for the detailed comparison . Table 2 
presents the distribution of the depths of corrosion for the 2,246 
segments . 

Performance . . . 

...the depth of corrosion and the axial and circumferential extent of the 
corrosion. Each performance measure , detection, and measurement accuracy 
is described. 

There are several reasons for quantifying the performance of an ILI 
survey. First . . . 
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... regressions than simple OLS does not appear useful. On the whole, 

the estimates of Table 1 suggest that public capital is a fairly 
important determinant of growth. The implicit concept of . . . 

...whether the intercept has shown a structural, if unspecified, change. In 
column 1 of Table 2 , a time dummy (TIME) is added. As expected, its 
coefficient is significantly negative and the (R.sup.2) rises (compare 
with column 1 of Table 1 ) . The type of growth observed is endogenous 
without convergence. Importantly, public capital is significant and... 

...for which there are enough data points) seems warranted. In view of the 
high degree of (negative) correlation between POPGR and PCY(-l) (0.48, 



0.4 and 0.58 in the whole sample. 
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... the control group. The significance of the overall F test is given 

in the penultimate column of the table . Comparisons with the control 
group were made using t tests of the parameter estimates. Group sizes... 

...some tests may be based on slightly smaller numbers. CPT = continuous 
performance test; MFF-20 = Matching Familiar Figures test. 

Objective Measures of Overactivity and Inattentive Behavior 
The rates of objective measures of movements are presented in Table 
5. Because these activity measures were counts of movements, tests between 
groups were carried out . . . 
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. . . coefficients on the age and gender variables are consistent with 

other estimates in the literature. 

Table 3 provides a comparison of Ashenfelter and Rouse's (1998) 
GLS estimates 

and those of . . . 

...report that the return to schooling is greater in the US than in 
Australia. The correlations between ability and average schooling level 
( (gamma) ) are very similar across the two studies ( compare columns 
(i) and (ii) of Table 3) . The coefficients on the other common variables 
are of the same order of magnitude... 

...schooling variables, the 3SLS method is employed. The results are 
reported in column (iv) of Table 3. Correcting for measurement error 
leads to an estimate of the return to schooling of... 
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. . . is captured by the Foster et al . measure, but not by the headcount 

ratio . 

Interregional comparisons 
Column (b) of Tables 1 and 2 details the interregional 
comparisons of poverty rates for each area type. Figure... 

...summarizes the rankings in a Hesse diagram. Generally, the headcount 
ratio and Foster et al. measure give similar rankings via poverty 
orderings. For the four regions, the headcount ratio shows that in 
nonmetropolitan . . . 
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... sample) percentage points of the 5.4 percentage points difference 

comes from differences in the measured impact of the variables, which 
corresponds to slightly less than 90%. The table 's diagonal shows the 
means of the predicted probabilities for one sample using the coefficients 



...one subsample on each particular subsample. As values of the 
coefficients are kept constant, the comparison of predictions inside one 
column gives a measure in which the asymmetry in unemployment rates arose 
from the different distribution. . . 
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... that is closer to Machin and van Reenen's (1993) results. 

The final column of Table 1 presents results based on IV estimation 
of the full dataset. Despite the changing sample... 

...coefficient on the import-sales ratio is no longer significant at the 10 
per cent level , the point estimate is similar in magnitude to the 
comparable estimate from column 1.1.* 

Two additional specifications not reported confirm the robustness of 
the main results. First... 
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... the six domains. Among the 15 ((n.sup.*)(n - l))/2 paired 

correlations, the average correlation coefficient across all six 
measures was only moderate strength (0.24, range 0.00-0.46), indicating 
that improvements in the six domains are relatively independent of each 
other . 

Table 2 presents cumulative gain ( area-under-the-curve ) comparisons 
for the intention to treat analysis... 

...standard deviation (z-score) improvement across all six measures for a 
full year) . Clozapine-haloperidol comparisons are summarized in the 
seventh column to the right, showing the percentage difference in 
improvement between groups. On the unweighted measure of composite 
effectiveness (upper panel in Table 2), the clozapine group showed 49 
percent greater improvement than the haloperidol group during" the... 



6/3,K/19 (Item 8 from file: 148) 

DIALOG (R) File 148: Gale Group Trade & Industry DB 
(c)2005 The Gale Group. All rts. reserv. 

10202281 SUPPLIER NUMBER: 20531375 (USE FORMAT 7 OR 9 FOR FULL TEXT) 

The domestic orientation of production and sales by U.S. manufacturing 
affiliates of foreign companies . 

Zeile, William J. 

Survey of Current Business, v78, n4, p29(22) 
April, 1998 

ISSN: 0039-6222 LANGUAGE: English RECORD TYPE: Fulltext; Abstract 

WORD COUNT: 11385 LINE COUNT: 01061 

for whom goods intended for further manufacture account for at 
least 50 percent of imports. Table 13 shows the industry-level 
import-share measures for this restricted sample of affiliates ( column 4) 
in comparison with the measures for all manufacturing affiliates (column 
1); the last two columns show the ratios of these measures to the 
corresponding measure for domestically owned U.S. parent companies. (43) 
In most industries, the import shares for... 
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end-line 
pressure calculated by any of 
the investigated four 
equations is less than the 

corresponding measured one . 

To look for a better 
accuracy than those obtained by 
the investigated four 



equations . 



...to solve the well 

known Colebrook implicit 
equation, were checked .( 6-9 ) 
(See accompanying box.) 

Table 2 presents the 
obtained errors for these 
equations as an example of their . 
accuracies. The. . .ASCII) 

Among all the 18 
equations evaluated, Panhandle B 
has been the most accurate 
when compared with field 
data. As highly accurate gas 
equations mean highly 
optimum pipeline design, the 
need is always . . . 
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... is expected to reduce the variance of the difference, as suggested 

in the introduction. 

A comparison across columns in Table 5 shows that even when the 
correlation coefficient is relatively high and cross-equation equality. . . 

...of the joint to independent variance of the difference does decline with 
increases in the level of correlation . Combined, these results suggest 
that as . correlations rise, the variance of the difference in estimated... 
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... to prevent hydrate formation. 

The temperature sensor gives correct temperature outside the pipe. A 
correlation table was established (through testing) during development of 
the system, and oil temperature inside the pipe is calculated. Both 
temperature values are displayed, and calculated temperature is verified by 

comparison with other nearby fields connected to Snorre that have a 
temperature sensor installed on the tree. 

Response is good. . . 

...The last 10 (degrees) C up to 45 (degrees) C takes 3-4 hr (45 ( degrees 
) C outside the pipe is equivalent to 90 ( degrees ) C inside) . 
The temperature sensor system consists of three parts: 
* An ROV-installed/retrievable clamp. . . 



J 
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the laser-based system were more accurate than those record-ed on 
the ILI log ( Table 1, fifth column ) . A similar comparison with 
manually measured corrosion (Client 3) also showed the laser-based 
measurements were more accurate ( Table 1, fifth column) . The net effect 
of this measurement accuracy, and the most significant benefit... 
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middle, appears to obviate the need to use any deflator. In this 
respect, it is similar to other measures of inequality, such as the 
Gini coefficient and the coefficient of variation. The approach is... each 
of the' corresponding base period segments. 

Gregory's key finding is set out in Table 2, page 73, of Gregory 
(1993) where ( comparing column 8 with column 7), he concludes that 
there has been a proliferation of both high-paying and low. . . 
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that organizations can be categorized by their dominant strategy. 
Moreover, the other correlations reported in Table 1 are all moderate in 
magnitude. This indicates common method variance is not likely to. . . 

. . .measures of strategy importance, and pay policy measures - appear quite 
independent from each other. The correlations between performance and 
strategy measures range from 0.05 to 0.18, correlations between 
performance and pay policy measures range from -0.11 to 0.18, and those 
between strategy and pay policy measures... 



.to 0.20. 

Differences in Pay Policies Between High-Performing Organizations with 



* .Different Dominant Strategies. 

Table 2 reports means for the ten pay policy measures. Means are 
reported for the whole... 



...are few differences between the subsample of 104 high-performing 
organizations and the whole sample. Compared with the entire sample 
(second column ), high-performing organizations (third column) differ only 
in that they use a more aggressive pay... 
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zero. 

Thus far in the econometric analysis we have focused exclusively on 
full-time enrollment. Table 3 displays the results when the log of 
part-time enrollment was regressed on the variables used in Regressions 1 
and 5 in Table 2. For comparative purposes, the first column 
replicates the results when log full-time enrollment is the dependent 
variable . 

The table makes clear the reasons for our reluctance to model 
"full-time equivalent ,f enrollment, that is, a weighted average of 
full-time and part-time enrollment. 

(TABULAR DATA FOR TABLE 2 OMITTED) 
Table 3 

Model of Log of Part-Time Enrollment (LPT) 

Dependent variable LFT LPT LPT 

Regressors . . . 
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... collectivistic culture x = 5.40; F(l,137) = 16.24, p (less than) 

.000) . 

RESULTS 

Table 1 reports the descriptive statistics and zero-order 
correlations among the variables. Correlations among dependent... 

...of simple demography variables (sex, age, and citizenship), as well as 
comparable relational demography variables measuring subjects' 
similarity to others. 

Hypothesis 1 predicted two effects: (1) that cooperative subjects in 
the matched cooperative. . . 

...effects were tested using a priori contrasts comparing the two matched 
groups, (TABULAR DATA FOR TABLE 1 OMITTED) respectively, with each of the 
three other conditions . Each comparison is presented in the last column 
of Table 2 . 



.i Table 2 shows that cooperative subjects in the collectivistic 

culture (group 4) were significantly more cooperative... 
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... the household files depends on the marketer's business and goals. 

Usually, after records are compared using the full data field , they're 
assigned to a category based on the degree of similarity . There are 
all-equal categories, where every field in the record precisely matches 
every other . . . 

...fields, and separate categories for all possible variations between 
records - rendering each household in the database unique. 

Since the 1970s, householding has been used largely by packaged goods, 
mail order, financial investment and retail banking marketers to add value 
to a database through the grouping of individual files into households. 
The same concept is applied to business... 
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... up to three contracts) and long run (all contracts) results are set 

out below in Tables 2 and 3, respectively. The first and second columns 
of each table show the "before" and "after" comparison of means between 
the RBO cases and the average-. . . 

...indicators: time in negotiations, grievance mediation and arbitration, 
total strikes and lockouts, and total conflict score . Similarly , 
columns three and four show the same results for the RBO control group. The 
final two columns reveal the results of comparing the before and after 
mean conflict scores of the RBO sample with those of the... 
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or the manager's own (bootstrapping) model. The results of both 



j approaches are shown in Table 2. Part A compares the average forecast 
errors of the managers (column 2) with the... 



...of both a manager /environmental model combination (column 3) and a 
manager/bootstrapping model combination ( column 5) . Comparing columns 
2 and 3, we see that all 13 managers are outperformed by combining their 
own . . . 

...managers who are least accurate. However, combining manager and model 
does not result in the level of accuracy that could be achieved by 
simply using the environmental model itself (89 pages). If the... 
...similar to those of the bootstrapping models themselves. 
TABULAR DATA OMITTED 

In Part B of Table 2, the worst-case forecast errors of the managers 
( column 8) are compared with the corresponding errors of both the 
manager/environmental model combination (column 9) and the... 
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... as probability weights and then applying Huber corrections to the 

standard errors (Model 4c in column 3). Comparing columns (2) and (3) 
of Table Al, we note first that when an employer announces the intent to 
permanently replace striking. . . 

...The coefficients on hires permanent replacements in the linear 
specifications with and without weighting are similar in magnitude. The 
significance level drops from the. 02 level (t = 2.35) in the unweighted 
version to the .056... 
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... an S&L is, the less likely it will become insolvent. 

In the last three columns of Table 4, we compare probits in 
which both the WAPM and DEA measures are used as regressors. Again, we only 
report the results for the WAPM measures to conserve space (they are very 
similar to the WACM measures ). Introducing the WAPM efficiency measure , 
with the proportion of potential comparisons failed, diminishes the 
importance . . . 
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... v = 5 and v = 7, and | Gamma 

=3, 4, and 8, The second panel of Table IV provides the results 
supporting the choices of the degrees of freedom. 
With the chosen. . . 

...Under the normality assumptions, Gibbons, Ross, and Shanken's (1989) 
exact test is reported in Table V as GRS which has an F distribution with 
degrees of freedom 12 and 107... 

...F distribution. Efficiency is rejected in three of the six subperiods at 
the" 5 percent level . To assess the accuracy of our proposed numerical 
approach, the same p-value is also computed numerically and reported as 
I P. sub. 0 

in the third column of the table .'A comparison of |P*.sub.O 
and | P. sub. 0 

indicates the anticipated accuracy. The numerical errors... 
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understood, points to the strong misgivings within CMP 6 as to the 
technical validity of Table 310-16 in at least some cases. This sentence 
remains today and the panel is... 

. . .be a necessary load cushion in those calculations to offset problems 
with the conventional ampacity tables . 
Is there really a problem? 

By way of example, the reader should look at the ampacity of 500kcmil 
THW in Table 310-16 and compare it to the Rho 90 ■ column in Table 
B-310-7. At first glance, the Appendix table looks far more favorable, 
giving 427A instead of 380A. Note, however, that the new table is based 
on 20 | degrees 

C, and Table 310-16 is based on 30 | degrees 

C. The new table has a footer table that applies a 90% factor on 
the new table ampacities if a '30 I degrees 

C temperature applies. Comparing apples and apples, the new ampacity 
comes out 38 4A at 30 I degrees 

C, very close to the old table . A similar analysis on the other 
conductor sizes yields similar results. We think this lends credence to the 
new table . 

We think that if this column is valid, the others probably are as 
well . They. . . 
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population growth is comparable to the hypothetical cycles depicted 
in Rows 4 and 6 in Table 3. Individuals who are born in a small cohort 
and who observe rising future cohorts... 

...is comparable to the hypothetical cycles depicted in Rows 3 and 5. 

The last three columns in Table 3 compare the level of schooling 
predicted when individuals respond to a given demographic cycle to that... 

...and older workers are imperfect substitutes for one another. In 
addition, they assume that the degree of substitutability is negatively 
correlated with the level of education. They formulate a two-period 
overlapping generations model. In this model workers have... 
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... of scores. This procedure resulted in 60 usable records, each one 

representing a different trainer. 

Table 2 presents the judges 1 responses to six statements on a 
five-point scale ranging from. . . 

...the volunteers offers a more accurate assessment of the volunteers' 
performance. The second and third columns of Table 2 make this 
comparison . The table shows that the judges gave the volunteers positive 
ratings, though not quite as high as... 

...their responses could affect future assignments in their counties. The 
average for the professionals was close to the maximum possible score 
of 5.00, while for the volunteers it was near 4.00. But these figures... 
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Sum of squared 
errors(b) 4,403.1 4,945.3 4,164.8 

{a) Comparisons : add consideration set size ( column 3) versus 
(column 2), F = 5.44; add number of brands (column 3) versus (column... 

...half of 1 percent of the households in the panel. The analysis is shown 



_in Table 3 . 

The number of brands is correlated with promotion intensity at the 
0.05 level , but the correlation for the number of brands is less than 
the correlation for the consideration set size... 
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. . . any, of the photoinitiators was faster curing or more efficient 

•during the UV curing phase. 

Table 7 gives data which show the effects of the type of 
photoinitiator and a post... 

. . .post bake process produced a large increase in tensile properties and Tg 
as indicated in Column B. A comparison of Column A with D and F and 
Column B with E and G shows that the. . . 

. . . total cure . 

We concluded this part of the project by evaluating the effects of 
the equivalent weight of the starting epoxide resin on cured film 
properties. The data given in Table 8 indicate that the higher 
equivalent weight ' epoxy resin, Epoxy B, when acrylated to a 1/1 acrylate 
to epoxy ratio, gives... 
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Making decisions from an interview: Expert measurement and mechanical 
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...TEXT: correlations, because restriction of range is likely to attenuate 
the interview accuracy, and because our database contains only 
interviewees that were subsequently drafted. 

Overview of the Results 

Table 2 presents the measures of accuracy of the various methods. 
Column 2 presents the correlations between predicted scores and the 
criterion. This correlation allows easy comparisons between the 
methods. Columns 3 and 4 present, respectively, the mean standardized 
predicted score of the group that had. . . 

...OPS, is presented in column 5. Like the correlation in column 2, it is a 
measure of the method's accuracy . It is also the basis of the 
significance tests for the differences in accuracy between the methods (see 
Appendix) . 

Table 2 indicates that the methods that use criterion information are 
more accurate than the methods... 
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...TEXT: significantly correlated with any of the other intensity 
variables. While the two human capital intensity measures are highly 
correlated , they similarly do not show any consistent and significant 
correlation with other intensity variables. The lack of... 

...proxies to capture the characteristics of industries that may benefit 
from the Uruguay Round Agreement. 

Table 6 displays ordinary least squares regression results explaining the 
relationship between cumulative abnormal return and variables representing 
comparative advantage. In columns (l)-(5), each explanatory variable 
enters independently. In columns (6) and (7) all intensity measures... 
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...TEXT: with no supervisory experience are making slightly less today than' 
they did two years ago. 

Table 6 compares income with educational field of respondents. 
Although civil and industrial engineers reported the highest median 
salaries and environmental science. . . 

...too small for these groups to draw any concrete conclusions. Most other ■ 
respondents with engineering degrees reported similar median salaries 
($77,000 to $79,000). In a break with recent surveys, those with... 
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...TEXT: for the computer model. 

The results are presented and discussed in the following series of tables 
, and those from a typical operation are shown in Figure 1 . The top diagram 
in. . . 



...concentration in urine. The points in these diagrams show the 
concentrations actually measured in the field , which may be compared 



>dth the predicted values. 



Comments on Specific Operations 
Table I, Loader 1 

Benzene in breath was negligible before exposure, the computer model was 
accurate. . . 

...of work. The PBPK model value for phenol in urine was rather higher than 
that measured at the midday break, but accurate at the end of work. It 
underestimated the concentration the following morning, but not if... 
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...TEXT: these calculated hours. 

This pattern can most readily be seen in the third column of table 1, 
where a random error term is added to respondents 1 self-reported hours. The 
mean . . . 

...of this variable is compared with the self-reported measure in the top 
panel of table 1 and compared to the calculated workweek in the bottom 
panel. The distribution of discrepancies... 

. . .measure, including a random term, is the same as the other discrepancy 
results (that is, compare columns 4 and 5) . Thus, what appears to be 
exaggeration may instead be merely a reflection of the statistical artifact 
of regression to the mean between two measures that are correlated with 
some error. 

In self-reports / the apparent pattern of long, exaggerated working hours 
and. . . 
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...TEXT: results: the implied annual bias in the cPI is about 1.1 
percentage point . 

For comparison , the last column of table 1 shows results in which 
median real family income each year, calculated from the PSID... 

...and PSID median-income series are displayed in chart 2. The two median 
real-income measures have a correlation of 0.82 in levels, and the 
annual percent changes in these two measures have a correlation of 
0.80. The last column of table 1 reports estimates of equation (1) using 
the percent change in real median family income... 
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Political institutions and electric utility investment: A cross-nation 
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...TEXT: measure of formal constraints on executive discretion (EXECCON) 
and democracy (DEMOCR) from the Polity III database as well as the 
GASTIL25 index of political and civil rights. These measures , which are 
less closely tied to the notion of credible commitment, perform less 
effectively in predicting cross-nation variation... 

...the GASTIL index is lower for countries with more rights, explaining the 
inversion in sign) . 

Table 2 presents a similar set of analysis for the 38 country sample for 
which data . . . 

...the determination of the risk in investing in the infrastructure of 
developing countries. Indeed, a comparison of columns 1 and 2 suggests 
that judicial tenure tells most of the story told by the... 
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Impact on workers of reduced trade barriers : The case of Tunisia and 
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...TEXT: lower in Germany than they were in Tunisia, though higher in 
France and Italy (see table 1) . But even this is not an accurate 
measure of trade competitiveness because it averages wide variations in 
capital-labour ratios and conceals good productivity performances in 
sectors of comparative advantage, such as apparel and footwear (see table 
2). Ideally, the comparison should be made between workers having the same 
training and the same capital endowment, but the required comparative 
data and field studies are not available. A judgment may be made on the 
basis of activities relocated. . . 
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...TEXT: the analysis of the air samples collected during Phase II.' These 
results are shown in Table IV. 



Once the analysis of the air samples was completed, the predicted 



• r « concentrations were compared with the measured concentrations. There was 
no correlation between the predicted and observed far field 
concentrations although the results were of the same... 

...concentrations ranged from 0.96 to 2.78 mg/m sup 3 . The predicted near 
field concentrations were compared with the averages of the two near 
field samples. This comparison showed that the predicted and measured 
values were related. Figure 5 illustrates the relationship bet... 
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Comparative marketing: An interdisciplinary framework for institutional 
analysis 
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...TEXT: reviewed above. A more holistic perspective emerges from the 
argument that context matters. For example, similar degrees of 
calculativeness in individual behavior across various marketing systems may 
not translate into similar performance implications for each system. 

{ Table Omitted) 

Captioned as: TABLE 2 

The major argument advanced here is that significant variations in 
marketing systems could be... 

...whether differences or similarities exist, but also on explaining such 
differences and similarities. While the field of comparative marketing 
systems may not have advanced to a level that such explanations are 
possible [Boddewyn... 
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...TEXT: of Internet and Web access and use. To figure the effect of 
reweighting the sample, Table 2 reports estimates of Internet and Web 
access and use based on both the Nielsen. . . 

. . .weights for the four measures of Internet use that were publicly 
released. Column (a) in Table 2 shows the original estimates released by 
the CommerceNet Consortium and Nielsen Media Research. These... 

...the sample of North Americans 16 years and older and were weighted only 
by gender. 

( Table Omitted) 

Captioned as: Table 7. Activities performed on the Web by Web use 
segments 



The effect of reweighting the CNIDS raw data is shown by comparing 



*w •» columns (b) and (c) in Table 2. The CNIDS estimates for the U.S., using 
Nielsen's weights, are shown in column (b) of Table 2. The corresponding 

estimates using the Project 2000 weights appear in column (c) in Table 
2. We can also examine the effect of reweighting by studying columns (d) 
and (e) in Table 2 to compare the consistency-corrected estimates of 
Internet and Web access and use using... 
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...TEXT: were collected in the form of frequency counts, row by column (r o 
c) contingency tables were used to present the data. In these tables ( 
Tables 5 to 8), r = 2 and c = 5. The rows represent either two species or 
mill-run compared to graded switch-tie cants, or green compared to dry 
lumber. The columns correspond to the five grades : SS, No. 1, No. 2, 
No. 3, and Below Grade. Each lumber piece in a sample was assigned to the 
appropriate cell of a table . We wanted to test the hypothesis that the 
probability of being in a given grade... 
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...TEXT: the last ten years [ 15, 1 6] . CA is an exploratory multivariate 
technique that converts frequency tables into graphical displays in which 
rows and columns are depicted as points. It provides a method for 
comparing row and column proportions in a two-way or multivariate table 
. Mathematically, CA decomposes the chi sup 2 measure of association of 
the table into components in a manner similar to that of principal 
component analysis for continuous data... 

...the analysis. The name "correspondence analysis" refers to the fact that 
the row and column scores are reported in corresponding units, which 
permits the portrayal of the points in joint space and facilitates 
interpretation. The... 
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Subsidising consumer services: Effects on employment, welfare and the 
informal economy 

Frederiksen, Niels; Hansen, Peter; Jacobsen, Henrik; Sorensen, Peter 
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.TEXT: elasticity in Denmark. The parameter values derived through this 



» -procedure turned out to imply a level of home production corresponding 
to the top figures in the last two columns of Table 2. As already 
suggested, these figures may be seen as a model estimate of that part of 
home production that is a perfect substitute for services supplied from the 
market . 

Comparing the first and third columns of Table 2, we see that the 
level of activity in the informal economy is much lower. . . 
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Address matching - GIS technology for mapping human activity patterns 
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...TEXT: applied, and all applied criteria determine whether the match is 
successful or unsuccessful. 

The scoring table approach (used by ARC/INFO release 6 and ARCVIEW 
release 1) first applies a soundex... 

...a soundex match could have a three point penalty. The candidate segment 
with the highest score would then be assigned the match , as long as the 
score were above a user-specified minimum. 

The statistical probability approach (licensed for future release in... 

...general theory of record matching (Jaro 1989, Jaro 1993). (8) To simplify 
somewhat, the target database and reference database are both broken 
into subsets based upon values of a common field such as zip. . . 

...subset every target record is compared to every reference record. During 
this one-to-one comparison , address-related fields of the target record 
(street number, directional prefix, etc.) are assigned individual weights 
( scores ) based upon their similarity to the reference record. The 
weights are then summed for a total weight (score) representing the 
likelihood that the given target... 

...segment. Record comparisons with total weights above a user-specified 
threshold value are designated as matches ; comparisons with weights 
under a second user-specified threshold are nonmatches; and comparisons 
with weights between the two. . . 
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...TEXT: unit price and vice versa. To overcome the interaction effect of 
quantity and unit prices, corresponding weighted unit prices were also 
determined for both day and night projects. The weighted unit prices... 



.an item-by-item comparison of rates has been performed and results are 



tabulated in Table 2. The last two columns of the table give the 
comparison of actual and weighted means for nighttime and daytime jobs, 
respectively. A negative sign in. . . 

. . .mean unit prices was also performed and results are tabulated in the 
last column of Table 3. 

To quantify the difference of nighttime and daytime unit prices and to 
demonstrate its... 



6/3,K/55 (Item 17 from file: 15) 

DIALOG (R) File 15 : ABI /Inform ( R) 

(c) 2005 ProQuest Inf o&Learning . All rts. reserv. 
00852062 95-01454 

Losers and winners in economic growth 

Barro, Robert J; Lee, Jong-Wha; Husain, Ishrat; Hart, Gillian 

World Bank Research Observer Annual Conference on Development Economics 

Supplement PP: 267-314 1993 

ISSN: 0257-3032 JRNL CODE: WBA 

WORD COUNT: 16605 

...TEXT: which the revolution variable is used as its own instrument. 

Columns 4 and 5 of table 5 show the results when the countries with 1965 
GDP per capita below the median... 

...1980 prices) are separated from those above the median. Some differences 
show up from a comparison of the two columns ; for example, the 
coefficients for schooling and life expectancy variables are larger for the 
poorer. . . 

...the investment ratio and the black-market premium. The most striking 
observation, however, is the degree of similarity between the two sets 
of coefficients, despite the great difference in average levels of real... 
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...TEXT: civilian populations, and in articles versus dissertations. 

Third, as shown in the top row of Table 1, the overall mean absolute 
value correlation corrected for both upward bias and discontinuity ( column 

5) is .118. This result compares with an overall sample- weighted mean 
correlation of .055 (for job proficiency criteria) derivable from Barrick 
and Mount's Table 3 (and Ones et al . ' s Table 1). After further 
correcting for unreliability in both criterion and predictor measures, the 
values are . . . 
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...TEXT: rate for all but one of the years during the period. 

The second pair of columns in Table 5 compares the safety performance 
of Canadian Level 2 carriers and the largest U.S. commuter airlines... 

...fatality and injury rates can be calculated and compared. As can be seen 
in the table , the Canadian carriers have lower fatality rates while the 
U.S. carriers' have slightly lower serious injury rates. The safety 
performance of these two segments appear quite close . 

The Level 2 fatality rate is three times the Level 1 rate and the U.S. 
large . . . 

...landing aids.- A similar pattern seems to prevail in the Canadian 
industry. 

The final three columns of table compare the safety performance of 
the Canadian Level 3 carriers with that of medium size U... 
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. . .TEXT: data) are involved, a high level of significance obtains at a 
relatively low correlation coefficient. 

Tables 5 and 6 report the results of the logistic regressions, taking 
care of the multicollinearity problems. ( Tables 5 and 6 omitted) The two 
tables are very significant in terms of their goodness of fit to the 
overall models, as indicated by their likelihood ratio or chi-square 
statistic . 



Table 5 provides four different models. Models 5.1 and 5.2 are 
equivalent, except that... 

...5.1 the dummy variables corresponding to the 7-year time frame were 
excluded. A comparison of the two columns suggests that the innovation 
rate is time dependent. The chi-square statistics which are associated... 

...over time, net of other variables. The incremental chi-square is 13.54; 
since its degrees of freedom are six { equivalent to the number of dummy 
variables entered into model 5.2)-, it is significant at... 
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. . TEXT : impact. (The authors will provide on request complete estimates 
including t statistics and control variables.) 

Table 2 also indicates that AFDC-U caseload reductions are three times 
larger than AFDC reductions ... percentage of AFDC heads were exempt. 

B. SELF-SELECTION BIAS 

The last two columns of table 2 provide a sensitivity test for the impact 
estimates. The same estimation model generates these columns , which limit 
the comparison group to only those counties that initially applied for 
the workfare demonstration. This procedure reduces... 

...estimated workfare effects remain substantial and conform to the 
anticipated lag structure. Additional tests with matched counties and 
weighted regressions yield similar results. Hence, the general pattern 
of average, differential (AFDC versus AFDC-U), and lagged effects... 
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...TEXT: the employer with whom they interviewed last to allow meaningful 
analysis of job tour information. ( Table 1 omitted) 

This sample profile was quite similar to data for the institutions sampled 
and. . . 

...U.S. Office of Educational Research, 1989) and 17% reported by AAES . In 
addition, the degree major profile was highly similar to the percentage 
of degrees awarded by the sample institutions and did not vary from AAES 
national percentages by more than 5% in any field . Although these 
comparisons do not disprove nonresponse bias, they suggest that gender or 
discipline related effects are not... 
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...TEXT: influenced by the advertiser's "model" of the target group, and 
may, as demonstrated in Table I, influence — and probably also "bias" — the 
target group's evaluations. 

An interesting question is... 

...in the study compare with evaluations made by tourists who have actually 
visited the country? 

Table III shows the mean scores on 14 of the 23 evaluated aspects 
included in the survey among actual tourists ( 16) , the test and control 
groups. ( Table III omitted) Columns (4) and (5) report deviations in 



devaluations between the test group and actual tourists, and control group 
and actual tourists respectively. 

Inspection of Table III reveals some very interesting findings. First, 
when comparing columns (1), (2) and (3) it appears that the pattern of 
the scores is very similar for the three groups. This is reflected also 
in correlation coefficients between the scores for actual tourists and 
the test group (r = 0.79, p < 0.001, r(s... 



6/3,K/62 (Item 24 from file: 15) 

DIALOG (R) File 15 : ABI /Inform (R) 

(c) 2005 ProQuest Inf o&Learning . All rts. reserv. 
00627564 92-42504 

Trading MIPS and Memory for Knowledge Engineering 

Creecy, Robert H.; Masand, Brij M. ; Smith, Stephen J.; Waltz, David L. 
Communications of the ACM v35n8 PP: 48-64 Aug 1992 
ISSN: 0001-0782 JRNL CODE: ACM 
WORD COUNT: 37 4 9 

...TEXT: that is preferred by standard methods, and it brings home the true 
dimensionality of the database . In the case of the census database , 
there are over 50,000 different words, potentially corresponding to a 
50, 000-dimensional, binary... 

...performing nearest neighbor classification with free text is that the 
textual data contained in the fields is not easily comparable . There 
are so many different ways of expressing occupations and industries as 
phrases that exact... 

...with only a slight modification: "Serving food and drinks"). With 
numeric fields, a distance or degree of match can be computed even in 
the absence of exact match (i.e., we know 100... 
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...TEXT: Bridges function at the Datalink (Layer 2) Level and routers at 
the Network (Layer 3} Level . They provide similar but distinct 
functions . 

Both link physically separate LANs to enable communications between 
workstations on distinctly. . . 

...layer to permit -elementary network partitioning. A bridge looks at the 
48-bit destination address field in each packet and compares it with 
the addresses in its routing tables : If the address has a local 
destination the packet is sent to its intended recipient... 



